Load the package and the data
library(RColorBrewer) ## nicer color schemes
library(xcms) ## the package doing the job
library(tidyverse) ## potentially useful
This library is used only to get some raw data to play with …
library(faahKO)
we will analyze a subset of the data from in which the metabolic consequences of knocking out the fatty acid amide hydrolase (FAAH) gene in mice was investigated. The raw data files (in NetCDF format) are provided with the faahKO data package. The data set consists of samples from the spinal cords of 6 knock-out and 6 wild-type mice. Each file contains data in centroid mode acquired in positive ion mode form 200-600 m/z and 2500-4500 seconds. To speed up processing of this vignette we will restrict the analysis to only 8 files and to the retention time range from 2500 to 3500 seconds.
Note A large parte of this tutorial is taken from the official vignette of xcms. Many thanks to Steffen Neumann and Juhannes Rainer!
We start getting some raw data into R.
## Get the full path to the CDF files
cdfs <- dir(system.file("cdf", package = "faahKO"), full.names = TRUE,
recursive = TRUE)[c(1, 2, 5, 6, 7, 8, 11, 12)]
cdfs
## [1] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.0/faahKO/cdf/KO/ko15.CDF"
## [2] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.0/faahKO/cdf/KO/ko16.CDF"
## [3] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.0/faahKO/cdf/KO/ko21.CDF"
## [4] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.0/faahKO/cdf/KO/ko22.CDF"
## [5] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.0/faahKO/cdf/WT/wt15.CDF"
## [6] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.0/faahKO/cdf/WT/wt16.CDF"
## [7] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.0/faahKO/cdf/WT/wt21.CDF"
## [8] "/home/franceschp/R/x86_64-pc-linux-gnu-library/4.0/faahKO/cdf/WT/wt22.CDF"
As you can see we have four injections belonging to two classes, ko and wt.
In the last few years the xcms developer has been making a big effort to make their package coherent with a general framework for the handling of MS data in R (metabolomics, proteomics, …)
all this goes beyond the scope of our course, for us is sufficient to know that this infrastructure allows to store sample “metadata” (e.g. treatment class, time point, etc) together with the raw experimental data.
In our specific case, the tibble with the phenotypic data could be designed as follow
phenodata <- tibble(sample_name = sub(basename(cdfs), pattern = ".CDF",
replacement = "", fixed = TRUE),
sample_group = c(rep("KO", 4), rep("WT", 4)))
phenodata
## # A tibble: 8 x 2
## sample_name sample_group
## <chr> <chr>
## 1 ko15 KO
## 2 ko16 KO
## 3 ko21 KO
## 4 ko22 KO
## 5 wt15 WT
## 6 wt16 WT
## 7 wt21 WT
## 8 wt22 WT
Up to now nothing has been actually loaded into R. To do that
raw_data <- readMSData(files = cdfs,
pdata = new("NAnnotatedDataFrame", phenodata), ## this is the structure of xcms holding phenotypic data
mode = "onDisk")
We next restrict the data set to the retention time range from 2500 to 3500 seconds, just to spare some time …
raw_data <- filterRt(raw_data, c(2500, 3500))
The raw_data object contains the full set of 2D data collected in all my 8 samples. The “raw” values can be extracted by using
rt <- rtime(raw_data)
mz <- mz(raw_data)
I <- intensity(raw_data)
Let’s look to the structure of thse three objects
glimpse(rt)
## Named num [1:5112] 2501 2503 2505 2506 2508 ...
## - attr(*, "names")= chr [1:5112] "F1.S0001" "F1.S0002" "F1.S0003" "F1.S0004" ...
These are the retention time in seconds for the chromatography of all the 8 files. “F!.S0001” stands for File1, scan number 1 … it was recorded at 2501 seconde …
Another way to see that
plot(rt)
Where we see the increasing time scale for each file.
mz and I holds the mass spectra collected at each scantime … for this reason the two objects are lists and not vectors. Remember our data are 2d. For each scantime we have a complete mass spectrum
## only the first 20
glimpse(mz[1:20])
## List of 20
## $ F1.S0001: num [1:429] 200 201 202 203 204 ...
## $ F1.S0002: num [1:431] 200 201 202 203 204 ...
## $ F1.S0003: num [1:431] 200 201 202 203 204 ...
## $ F1.S0004: num [1:427] 200 201 202 203 204 ...
## $ F1.S0005: num [1:422] 200 201 202 203 204 ...
## $ F1.S0006: num [1:433] 200 201 202 203 204 ...
## $ F1.S0007: num [1:430] 200 201 202 203 204 ...
## $ F1.S0008: num [1:425] 201 202 203 204 205 ...
## $ F1.S0009: num [1:434] 201 202 203 204 205 ...
## $ F1.S0010: num [1:425] 201 202 203 204 205 ...
## $ F1.S0011: num [1:435] 201 202 203 204 205 ...
## $ F1.S0012: num [1:431] 201 202 203 204 205 ...
## $ F1.S0013: num [1:438] 201 202 203 204 205 ...
## $ F1.S0014: num [1:445] 201 202 203 204 205 ...
## $ F1.S0015: num [1:429] 201 202 203 204 205 ...
## $ F1.S0016: num [1:427] 201 202 203 204 205 ...
## $ F1.S0017: num [1:428] 201 202 203 204 205 ...
## $ F1.S0018: num [1:431] 201 202 203 204 205 ...
## $ F1.S0019: num [1:435] 201 202 203 204 205 ...
## $ F1.S0020: num [1:425] 201 202 203 204 205 ...
## only the first 20
glimpse(I[1:20])
## List of 20
## $ F1.S0001: num [1:429] 1716 1723 2814 1961 667 ...
## $ F1.S0002: num [1:431] 1502 1930 2893 1753 699 ...
## $ F1.S0003: num [1:431] 1453 1828 3145 1521 728 ...
## $ F1.S0004: num [1:427] 1650 1655 3499 1414 656 ...
## $ F1.S0005: num [1:422] 1807 1589 3686 1285 590 ...
## $ F1.S0006: num [1:433] 1698 1732 3778 1219 626 ...
## $ F1.S0007: num [1:430] 1486 2089 3938 1236 893 ...
## $ F1.S0008: num [1:425] 2280 4256 1333 1135 1628 ...
## $ F1.S0009: num [1:434] 2374 4358 1318 1226 1560 ...
## $ F1.S0010: num [1:425] 2539 4160 1226 1131 1453 ...
## $ F1.S0011: num [1:435] 3043 3844 1144 1002 1466 ...
## $ F1.S0012: num [1:431] 3730 3943 1211 858 1432 ...
## $ F1.S0013: num [1:438] 4521 4384 1318 779 1281 ...
## $ F1.S0014: num [1:445] 5422 4793 1439 733 1249 ...
## $ F1.S0015: num [1:429] 6319 4884 1534 782 1219 ...
## $ F1.S0016: num [1:427] 7074 4872 1677 871 1171 ...
## $ F1.S0017: num [1:428] 7600 4702 1624 985 1157 ...
## $ F1.S0018: num [1:431] 7726 4443 1476 1013 1150 ...
## $ F1.S0019: num [1:435] 7646 4142 1381 936 1144 ...
## $ F1.S0020: num [1:425] 7512 3877 1308 804 1196 ...
We can plot a complete spectrum (here the first scan of the first sample …)
plot(mz$F1.S0001,I$F1.S0001, type = "h", main = names(mz)[1])
Ok, working with all files together is not the best … for visualization and handling. To facilitate the “cutting” by file, xcms is provided with a split() function which can be combined with a fromFile function to create a list with the content separate by file
by_file_mz <- split(mz, fromFile(raw_data))
by_file_rt <- split(rt, fromFile(raw_data))
by_file_I <- split(I, fromFile(raw_data))
length(by_file_mz)
## [1] 8
As we discussed, metabolites are visible as peaks ia 2d mz/rt plane …
mytibble <- tibble(rt = by_file_rt[[2]], mz = by_file_mz[[2]], I = by_file_I[[2]])
mytibble
## # A tibble: 639 x 3
## rt mz I
## <dbl> <named list> <named list>
## 1 2501. <dbl [424]> <dbl [424]>
## 2 2503. <dbl [423]> <dbl [423]>
## 3 2505. <dbl [433]> <dbl [433]>
## 4 2506. <dbl [427]> <dbl [427]>
## 5 2508. <dbl [423]> <dbl [423]>
## 6 2509. <dbl [427]> <dbl [427]>
## 7 2511. <dbl [427]> <dbl [427]>
## 8 2512. <dbl [431]> <dbl [431]>
## 9 2514. <dbl [425]> <dbl [425]>
## 10 2515. <dbl [427]> <dbl [427]>
## # … with 629 more rows
jet.colors <- colorRampPalette(c("#00007F", "blue", "#007FFF", "cyan", "#7FFF7F", "yellow", "#FF7F00", "red", "#7F0000"))
mytibble %>%
unnest(c("mz","I")) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = sqrt(I)), size = 0.1) +
scale_color_gradientn(colours = jet.colors(7)) +
theme_light()
In the first phase of the inspection is useful to visualize the “total” chromatograpic current …
## Get the total ion chromatograms. This reads data from the files.
tics<- chromatogram(raw_data, aggregationFun = "sum")
## Define colors for the two groups
group_colors <- paste0(brewer.pal(3, "Set1")[1:2], "60")
names(group_colors) <- c("KO", "WT")
## Plot all chromatograms.
plot(tics, col = group_colors[raw_data$sample_group])
## raw_data$sample_group extracts the info on the phenotipic data inside the raw_data
As you can see, nothing happens after 4200 seconds … we also appreciate the variability of the results.
This visualization already shows you how it will be tricky to “match” the different samples. Some of the peaks are present everywhere, but others show-up only in specific samples …
The overall integral of the signal in each sample is often used as a way to spot unexpected analytical drifts
## here we rely on the old (and efficient) R style
total_I <- sapply(tics, function(x) sum(intensity(x)))
plot(total_I, col = factor(raw_data$sample_group), pch = 19, cex = 2)
Here the response is comparable, but if this would be the injection order, we should highlight the absence of a proper randomization of the samples!
I know that a specific metabolite present in my samples yields an ion @mz335 … let’s look to the profile of this ion signal over the chromatographic time
## here we get the traces ... compare the function with the one used for the TICs
ion_I_know <- chromatogram(raw_data, mz = c(334.9, 335.1))
plot(ion_I_know, col = group_colors[raw_data$sample_group])
The previous plot is important. It is tellin us that the metabolite_I_know is present in the sample and is released by the chromatographic column at around 2800 sec … There it is producing a peal in the signal of the ion_I_know @mz335
To automatically find metabolites in my data I have to teach a computer to look for peaks in the chromatographic traces of all possible ions
bin - see ?bin for details)The “older” and most sounding way of finding peaks implemented in xcms is the matched filter algorithm.
A full description of the parametrs of the algorithm can be found in the xcms manual, here we focus on
In xcms the parameters of the algorithm are stored into aspecific object
mf <- MatchedFilterParam(binSize = 0.1,
fwhm = 30 ,
snthresh = 6)
mf
## Object of class: MatchedFilterParam
## Parameters:
## binSize: 0.1
## impute: none
## baseValue:
## distance:
## fwhm: 30
## sigma: 12.73994
## max: 5
## snthresh: 6
## steps: 2
## mzdiff: 0.6
## index: FALSE
Now I can use the previous parameters to find the peaks in one sample
first_file <- readMSData(files = cdfs[1],
mode = "onDisk") %>%
filterRt(c(2500, 3500))
first_peaks <- findChromPeaks(first_file,param = mf)
The actual list of peaks can be extracted from the previous object by the method chromPeaks
Let’s look to the head of the output
first_peak_table <- chromPeaks(first_peaks)
dim(first_peak_table)
## [1] 464 13
head(first_peak_table, 5)
## mz mzmin mzmax rt rtmin rtmax into intf maxo
## CP001 200.1000 200.1 200.1 2928.610 2912.961 2942.695 147887.5 290507.1 9687
## CP002 201.0638 201.0 201.1 2531.112 2515.463 2549.892 204572.4 273087.9 7726
## CP003 205.0000 205.0 205.0 2784.635 2770.550 2800.284 1778568.9 3610061.8 84280
## CP004 205.9819 205.9 206.0 2786.200 2772.115 2800.284 237993.6 448580.2 10681
## CP005 207.0821 207.0 207.1 2712.647 2698.562 2726.731 380873.0 730981.3 18800
## maxf i sn sample
## CP001 15899.06 1 9.615395 1
## CP002 13172.68 1 9.050401 1
## CP003 195026.47 1 48.163309 1
## CP004 23860.10 1 24.225607 1
## CP005 40065.74 1 18.000238 1
The first two numbers are telling us that with the setting we have been choosing we were able to find 464 peaks
The help of xcms describes the most relevant columns of the table
“mz” (intensity-weighted mean of mz values of the peak across scans/retention times), “mzmin” (minimal mz value), “mzmax” (maximal mz value), “rt” (retention time of the peak apex), “rtmin” (minimal retention time), “rtmax” (maximal retention time), “into” (integrated, original, intensity of the peak), “maxo” (maximum intentity of the peak), “sample” (sample index in which the peak was identified)
This is the map of the identified peaks superimposed to the “real” experimental data
peaks_tibble <- as_tibble(first_peak_table)
tibble(rt = by_file_rt[[1]], mz = by_file_mz[[1]], I = by_file_I[[1]]) %>%
unnest(c("mz","I")) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = sqrt(I)), size = 0.1) +
geom_point(data = peaks_tibble, mapping = aes(x = rt, y = mz), col = "orange", alpha = 0.5) +
scale_color_gradientn(colours = jet.colors(7)) +
theme_light()
Peak piching can also be performed with another algorithm … CentWave …
cwp <- CentWaveParam(peakwidth = c(20, 80),
ppm = 30,
prefilter = c(3, 5000))
Also here many parameters (and others are not mentioned). I highlight here some of them
If we run the peak picking with this new algorithm …
first_peaks_cw <- findChromPeaks(first_file,param = cwp)
first_peak_table_cw <- chromPeaks(first_peaks_cw)
dim(first_peak_table_cw)
## [1] 353 11
head(first_peak_table_cw, 5)
## mz mzmin mzmax rt rtmin rtmax into intb maxo
## CP001 453.2 453.2 453.2 2509.203 2501.378 2527.982 1007409.0 1007380.8 38152
## CP002 236.1 236.1 236.1 2518.593 2501.378 2537.372 253501.0 228013.2 12957
## CP003 594.0 594.0 594.0 2601.535 2581.191 2637.529 161042.2 149109.3 7850
## CP004 577.0 577.0 577.0 2604.665 2581.191 2626.574 136105.2 130646.0 6215
## CP005 369.2 369.2 369.2 2587.451 2556.151 2631.269 483852.3 483777.1 7215
## sn sample
## CP001 38151 1
## CP002 12 1
## CP003 13 1
## CP004 16 1
## CP005 7214 1
As you see the number of columns is different, but the key infos are there. Remarkably we have been able to find less peaks
peaks_tibble_cw <- as_tibble(first_peak_table_cw)
tibble(rt = by_file_rt[[1]], mz = by_file_mz[[1]], I = by_file_I[[1]]) %>%
unnest(c("mz","I")) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = sqrt(I)), size = 0.1) +
geom_point(data = peaks_tibble_cw, mapping = aes(x = rt, y = mz), col = "orange", alpha = 0.5) +
scale_color_gradientn(colours = jet.colors(7)) +
theme_light()
If we superimpose them …
tibble(rt = by_file_rt[[1]], mz = by_file_mz[[1]], I = by_file_I[[1]]) %>%
unnest(c("mz","I")) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = sqrt(I)), size = 0.1, alpha = 0.5) +
geom_point(data = peaks_tibble_cw, mapping = aes(x = rt, y = mz), col = "red", alpha = 0.7) +
geom_point(data = peaks_tibble, mapping = aes(x = rt, y = mz), col = "green", alpha = 0.7, shape = 1) +
scale_color_gradientn(colours = jet.colors(7)) +
theme_light()
The difference is striking.
Obviously one could fiddle around with the parameters to look for a more coherent picture, but the difference is not unexpected considering the fact that we are dealing with two different approaches
When we are satisfied with a set of peak picking parameters, the algorithm will be sequentially run on all the files of the dataset resulting in a large list of peaks assigned to the different samples.
xdata <- findChromPeaks(raw_data, param = cwp)
Here a table of the peaks found in all files
table(chromPeaks(xdata)[,"sample"])
##
## 1 2 3 4 5 6 7 8
## 353 426 191 211 382 254 216 280
An overall representation of their distribution in the plane is extremely interesting
chromPeaks(xdata) %>%
as_tibble() %>%
ggplot() +
geom_point(aes(x = rt, y = mz), siaze = 0.3) +
facet_wrap(~sample) +
theme_light()
As you can see the samples are different, but the overall “look and feel” is coherent. This is telling us that the overall analytical run was good.
I was metioning retention time shifts …
chromPeaks(xdata) %>%
as_tibble() %>%
filter(sample %in% c(1,8)) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = factor(sample)), siaze = 0.3) +
scale_color_brewer(palette = "Set1") +
theme_light()
… The shift is clear! It is small, but it is visible. the shift is responsible of a difference in the samples coming from the analysis and not the biology. This shift has to be corrected to avoid biased results
xcms can do much more to browse and characterize the peaks, but here we want to focus on the key ideas.
In summary:
The alignment step, also referred to as retention time correction, aims at adjusting this by shifting signals along the retention time axis to align the signals between different samples within an experiment.
Also here a plethora of approaches is available. As usual, everything will work better if the chormatography is more reproducible (for GC, for example, retention time correction is often not necessary).
In xcms the most used and reliable method for alignment of high resolution experiments is based on the obiwarp approach. The algorithm was developed for proteomics and is based on dynamic time warping.
The alignment is performed directly on the profile-matrix and can hence be performed independently of the peak detection or peak grouping.
xdata <- adjustRtime(xdata, param = ObiwarpParam(binSize = 0.2))
It is of utmost importance to check the amount of correction since large time shifts are not reasonable
plotAdjustedRtime(xdata, col = group_colors[xdata$sample_group])
The previous plot shows the extent of correction of the different samples (one is not corrected since is considered the reference).
As you can see the correction is never bigger than 15 seconds. With a chromatographic peak width of around 30 seconds this is more than acceptable and, another time it speaks of a overall good analytical reproducibility
xdata now still contains the list of the peaks for the different samples, but now they retention time should be less erratic …
chromPeaks(xdata) %>%
as_tibble() %>%
filter(sample %in% c(1,8)) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = factor(sample)), siaze = 0.3) +
scale_color_brewer(palette = "Set1") +
theme_light()
As you can see the situation has improved and some of the verticl stripes are now well aligned.
The last step is to find a consensus list of variable across the different samples. The list of peaks is now aligned in retention time but: